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Abstract 

We analyze the minimax regret of the adversarial bandit convex optimization prob¬ 
lem. Focusing on the one-dimensional case, we prove that the minimax regret is ©(vT) 
and partially resolve a decade-old open problem. Our analysis is non-constructive, as 
we do not present a concrete algorithm that attains this regret rate. Instead, we use 
minimax duality to reduce the problem to a Bayesian setting, where the convex loss 
functions are drawn from a worst-case distribution, and then we solve the Bayesian 
version of the problem with a variant of Thompson Sampling. Our analysis features a 
novel use of convexity, formalized as a “local-to-global” property of convex functions, 
that may be of independent interest. 


1 Introduction 

Online convex optimization with bandit feedback, commonly known as bandit convex opti¬ 
mization, can be described as a T-round game, played by a randomized player in an adversar¬ 
ial environment. Before the game begins, the adversarial environment chooses an arbitrary 
sequence of T bounded convex functions fi, ■ ■ ■, fr, where each ft : JC ^ [0,1] and /C is a 
hxed convex and compact set in M”. On round t of the game, the player chooses a point 
Xt E 1C and incurs a loss of ft{Xt). The player observes the value of ft{Xt) and nothing else, 
and she uses this information to improve his choices going forward. The player’s performance 
is measured in terms of his T-round regret, dehned as /t(M) — miujjgx; Ylt=i ft{x)- In 

* Parts of this work were done while the author was at Microsoft Research, Redmond. 
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words, the regret compares the player’s cumulative loss to that of the best hxed point in 
hindsight. 

While regret measures the performance of a specihc player against a specihc loss sequence, 
the inherent difficulty of the game is measured using the notion of minimax regret. Informally, 
the game’s minimax regret is the regret of an optimal player when she faces the worst-case 
loss sequence. Characterizing the minimax regret of bandit convex optimization is one of the 
most elusive open problems in the held of online learning. For general bounded convex loss 
functions, Flaxman et ah (2005) presents an algorithm that guarantees a regret of — 

and this is the best known upper bound on the minimax regret of the game. Better regret 
rates can be guaranteed if additional assumptions are made: for Lipschitz functions the 
regret is (Flaxman et ah, 2005), for Lipschitz and strongly convex losses the regret 

is 0(T^/^) (Agarwal et ah, 2010), and for smooth functions the regret is 0(T^/^) (Saha 
and Tewari, 2011). In all of the aforementioned settings, the best known lower bound on 
minimax regret is (Dani et ah, 2008), and the challenge is to bridge the gap between 

the upper and lower bounds. In a few special cases, the gap is resolved and we know that the 
minimax regret is exactly 0(-\/T); specihcally, when the loss functions are both smooth and 
strongly-convex (Hazan and Levy, 2014), when they are Lipschitz and linear (Dani et ah, 
2008; Abernethy et ah, 2008), or when they are Lipschitz and drawn i.i.d. from a fixed and 
unknown distribution (Agarwal et ah, 2011). 

In this paper, we resolve the open problem in the one-dimensional case, where /C = [0,1], 
by proving that the minimax regret with arbitrary bounded convex loss functions is 0(a/T). 
Formally, we prove the following theorem. 

Theorem 1 (main result). There exists a randomized player strategy that relies on bandit 
feedback and guarantees an expected regret of 0{\/T\ogT) against any seguence of convex 
loss functions /i,..., /^ : [0,1] h->■ [0,1]. 

The one-dimensional case has received very little special attention, and the best published 
result is the 0(T^/®) bound mentioned above, which holds in any dimension. However, by 
discretizing the domain [0,1] appropriately and applying a standard multi-armed bandit 
algorithm, one can prove a tighter bound of 0(T^/^); see Appendix D for details. It is worth 
noting that replacing the convexity assumption with a Lipschitz assumption also gives an 
upper bound of 0(T^/^) (Kleinberg, 2004). However, obtaining the tight upper bound of 
0{\/T) requires a more delicate analysis, which is the main focus of this paper. 

Our tight upper bound is non-constructive, in the sense that we do not describe an 
algorithm that guarantees a 0(a/T) regret for any loss sequence. Instead, we use minimax 
duality to reduce the problem of bounding the adversarial minimax regret to the problem 
of upper bounding the analogous maximin regret in a Bayesian setting. Unlike our original 
setting, where the sequence of convex loss functions is chosen adversarially, the loss functions 
in the Bayesian setting are drawn from a probability distribution, called the prior, which is 
known to the player. The idea of using minimax duality to study minimax regret is not new 
(see, e.g., Abernethy et ah, 2009; Gravin et ah, 2014); however, to the best of our knowledge, 
we are the hrst to apply this technique to prove upper bounds in a bandit feedback scenario. 
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After reducing our original problem to the Bayesian setting, we design a novel algorithm 
for Bayesian bandit convex optimization (in one dimension) that guarantees 0{y/T) regret 
for any prior distribution. Since our main result is non-constructive to begin with, we are 
not at all concerned with the computational efficiency of this algorithm. We hrst discretize 
the domain [0,1] and treat each discrete point as an arm in a multi-armed bandit problem. 
We then apply a variant of the classic Thompson Sampling strategy (Thompson, 1933) that 
is designed to exploit the fact that the loss functions are all convex. We adapt the analysis of 
Thompson Sampling in Russo and van Roy (2014) to our algorithm and extend it to arbitrary 
joint prior distributions over sequences of loss functions (not necessarily i.i.d. sequences). 

The signihcance of the convexity assumption is that it enables us to obtain regret bounds 
that scale logarithmically with the number of arms, which turns out to be the key property 
that leads to the desired 0{\/T) upper bound. Intuitively, convexity ensures that a change to 
the loss value of one arm influences the loss values in many of the adjacent arms. Therefore, 
even the worst case prior distribution cannot hide a small loss in one arm without globally 
influencing the loss of many other arms. Technically, this aspect of our analysis boils down 
to a basic question about convex functions: given two convex functions / : /C h->• [0,1] and 
(7 : /C I— )■ [0,1] such that f{x) < miiiyg^y) at some point x G /C, how small can ||/ — g\\ be 
(where || • || is an appropriate norm over the function space)? In other words, if two convex 
functions differ locally, how similar can they be globally? We give an answer to this question 
in the one-dimensional case. 

The paper is organized as follows. We begin in Section 2 where we dehne the setting of 
Bayesian online optimization, establish basic techniques for the analysis of Bayesian online 
algorithms, and demonstrate how to readily recover some of the known minimax regret 
bounds for the full information case by bounding the Bayesian regret. Then, in Section 3, 
we prove the key structural lemma by which we exploit the convexity of the loss functions. 
Section 4 is the main part of the paper, where we give our algorithm for Bayesian bandit 
convex optimization (in one dimension) and analyze its regret. We conclude the paper in 
Section 5 with a few remarks and open problems. 


2 From Adversarial to Bayesian Regret 

In this section, we show how regret bounds for an adversarial online optimization setting 
can be obtained via a Bayesian analysis. Before explaining this technique in detail, we hrst 
formalize two variants of the online optimization problem: the adversarial setting and the 
Bayesian setting. 

We begin with the standard, adversarial online optimization setup. As described above, 
in this setting the player plays a T-round game, during which he chooses a sequence of points 
where Xt E 1C for all t. The player’s randomized policy for choosing Xi-t is dehned 
by a sequence of deterministic functions pi-,T, where each pt : [0,1]*“^ i—)■ A(/C) (here A(/C) 
is the set of probability distributions over /C). On round t, the player uses pt and his past 

^Throughout the paper, we use the notation Us-.t as shorthand for the sequence Us, ■ ■ ■, at- 
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observations to define the probability distribution 

nt = p*(/i(Xi),...,/,_i(X,_i)) , 

and then draws a concrete point Xt ~ tt^. Even though pt is a deterministic function, the 
probability distribution vr^ is itself a random variable, because it depends on the player’s 
random observations /i(Xi),..., 

The player’s cumulative loss at the end of the game is the random quantity ^1^=1 
and his expected regret against the sequence is 

R{P1:T', /i:t) = E 

The difficulty of the game is measured by its minimax regret, defined as 

min sup R{pi:t; fi-.r) ■ 

h,T 

We now turn to introduce the Bayesian online optimization setting. In the Bayesian 
setting, we assume that the sequence of loss functions Fi,t, where each Ft : }C ^ [0,1] is 
convex, is drawn from a probability distribution F called the prior distribution. Note that 
X is a distribution over the entire sequence of losses, and not over individual functions in 
the sequence. Therefore, it can encode arbitrary dependencies between the loss functions on 
different rounds. However, we assume that this distribution is known to the player, and can 
be used to design his policy. The player’s Bayesian regret is defined as 

- j, rp 

R{pi..r,F) = E 

_t=i t=i 

where X* is the point in /C with the smallest cumulative loss at the end of the game, namely 
the random variable 

T 

X* = argminFi(a;) . (1) 

t=i 

The difficulty of online optimization in a Bayesian environment is measured using the max- 
imin Bayesian regret, defined as 

sup min R^pi-T^F) ■ 

T Pi-T 




In words, the maximin Bayesian regret is the regret of an optimal Bayesian strategy over 
the worst possible prior F. 

It turns out that the two online optimization settings we described above are closely re¬ 
lated. The following theorem, which is a consequence of a generalization of the von Neumann 
minimax theorem, shows that the minimax adversarial regret and maximin Bayesian regret 
are equal. 
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Theorem 2. It holds that 


min sup R{pi:t] fi-.r) = sup min ■ 

Pl:T Jj.y jr pi;T 

For completeness, we include a proof of this fact in Appendix A. As a result, instead of 
analyzing the minimax regret directly, we can analyze the maximin Bayesian regret. That 
is, our new goal is to design a prior-dependent player policy that guarantees a small regret 
against any prior distribution R. 

2.1 Bayesian Analysis with Full Feedback 

As a warm-up, we hrst consider the Bayesian setting where the player receives full-feedback. 
Namely, on round t, after the player draws a point ~ vr* and incurs a loss of Ft{Xt), we 
assume that she observes the entire loss function Ft as feedback. We show how minimax du¬ 
ality can be used to recover the known 0{VT) regret bounds for this setting. For simplicity, 
we focus on the concrete setting where /C = A^, (the n-dimensional simplex), and where the 
convex loss functions Fi-^t are also 1-Lipschitz with respect to the Li-norm (with probability 
one). 

The evolution of the game is specihed by a hltration Ri:t, where each Rt denotes the 
history observed by the player up to and including round t of the game; formally, Rt is the 
sigma-held generated by the random variables Xi-t and Fi-t. To simplify notations, we use 
the shorthand Et[-] = E[- | Rt-i] to denote expectation conditioned on the history before 
round t. The analogous shorthands Pt(-) and Vari(-) are dehned similarly. 

Recall that the player’s policy can rely on the prior F. A natural deterministic policy is 
to choose, based on the random variable X* dehned in Eq. (1), actions Xi-t according to 

VtG[T], Xt = Et[X*]. (2) 

In other words, the player uses his knowledge of the prior and his observations so far to cal¬ 
culate a posterior distribution over loss functions, and then chooses the expected best-point- 
in-hindsight. Notice that the sequence Xi-t is a martingale (in fact, a Doob martingale), 
whose elements are vectors in the simplex. 

The following lemma shows that the expected instantaneous (Bayesian) regret of the 
strategy on each round t can be upper bounded in terms of the variation of the sequence 
Xi-T on that round. 

Lemma 3. Assume that with probability one, the loss functions Fi-t OLf^ convex and 1- 
Lipschitz with respect to some norm || ■ ||. Then the strategy defined in Eq. (2) guarantees 
E[FtiXt) - FfiX*)] < E[||W - Xi+iil] for all t. 

Proof. By the subgradient inequality, we have FfiXt) — FfiX*) < XFfiXt) ■ {Xt — X*) for 
all t. The Lipschitz assumption implies that ||VFt(X 4 )||* < 1, where || ■ ||* is the norm dual 
of II ■ ||. Using Eq. (2), noting that Xt, Ft G Rt, and taking the conditional expectation, we 
get 

Et+t[Ft{Xt) - Ft{X*)] < VFt{Xt)-{Xt-Et+i[X*]) = V FfiXt) ■ {Xt - Xt+i) . 
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Finally, applying Holder’s inequality on the right-hand side and taking expectations proves 
the lemma. □ 


To bound the total variation of Xi-t, we use a bound of Neyman (2013) on the total 
variation of martingales in the simplex. 


Lemma 4 (Neyman, 2013). For any martingale Zi,. 
one has 

r T 1 


E 


E 

t=i 


Jt+illi 


< 


.., Zt +1 in the n-dimensional simplex, 


\l\T'iogn . 


Lemma 4 and Lemma 3 together yield a 0{y/T\ogn) bound on the maximin Bayesian 
regret of online convex optimization on the simplex with full-feedback. Theorem 2 then 
implies the same bound over the minimax regret in the corresponding adversarial setting, 
recovering the well-known bounds in this case (e.g., Kivinen and Warmuth, 1997). We 
remark that essentially the same technique can be used to retrieve known dimension-free 
regret bounds in the Euclidean setting, e.g., when /C is an Euclidean ball and the losses are 
Lipschitz with respect to the L 2 norm; in this case, the L 2 total variation of the martingale 
Xi,T can be shown to be bounded by 0{\/T) with no dependence on 


2.2 Regret Analysis of Bayesian Bandits 

The analysis in this section builds on the technique introduced by Russo and van Roy (2014). 
While their analysis is stated for prior distributions that are i.i.d. (namely, is a product 
distribution), we show that it extends to arbitrary prior distributions with essentially no 
modihcations. 

We begin by restricting our attention to hnite decision sets /C, and denote K = |/C|. 
(When we get to the analysis of Bayesian bandit convex optimization, /C will be an appro¬ 
priately chosen grid of points in [0,1].) In the bandit case, the history 77* is the sigma-held 
generated by the random variables Xi,t and Fi(Xi),... ,Ft{Xt). Following Russo and van 
Roy (2014), we consider the following quantities related to the hltration FLi^t'- 


r*(x) = Et[Ft{x) - Ft{X^)] , 
(x) = Var*(E*[F*(x) | X*]) 


The random quantity r*(x) is the expected regret incurred by playing the point x on round t, 
conditioned on the history. Hence, the cumulative expected regret of the player equals 
E[^^^ r*(X*)]. The random variable n*(x) is a proxy for the information revealed about X* 
by choosing the point x on round t. Intuitively, if the value of F)(x) varies signihcantly as 
a function of the random variable X*, then observing the value of F)(x) should reveal much 

^This follows from the fact that a martingale in K" can always be projected to a martingale in with 
the same magnitude of increments; namely, given a martingale Zi, Z2, ■ ■ ■ in K" one can show that there 
exists a martingale sequence Zi, Z2 ,... in such that \\Zt — Zt+i\\2 = \\Zt — Zt+i\\2 for all t. 
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Figure 1: An illustration of the local-to-global lemma. The L 2 distance between the 
reference convex function / to a convex function g in the interval where x* is the 

minimizer of / and a; is a point such that gix) < f{x*), can be lower bounded in terms of 
the shaded area that depicts the energy of the function / in the same interval. 


information on the identity of X*. (More precisely, Vt{x) is the amount of variance in Ft{x) 
explained by the random variable X*.) 

The following lemma can be viewed as an analogue of Lemma 4 in the bandit setting. 


Lemma 5. For any player strategy and any prior distribution F, it holds that 


E 


■ T 

^ yE,i«,(x,)] 

.t=i 



The proof uses tools from information theory to relate the quantity Vt{Xt) to the decrease 
in entropy of the random variable X* due to the observation on round t; the total decrease 
in entropy is necessarily bounded, which gives the bound in the lemma. For completeness, 
we give a proof in Appendix B. 

Lemma 5 suggests a generic way of obtaining regret bounds for Bayesian algorithms: hrst 
bound the instantaneous regret Et[ri(Xt)] of the algorithm in terms of ^y'Et[vt{Xt)] for all 
t, then sum the bounds and apply Lemma 5. Russo and van Roy (2014) refer to the ratio 
Ei [rt{Xt)]/\/'Et[vt (Xi)] as the information ratio, and show that for Thompson Sampling over 
a set of K points (under an i.i.d. prior F) this ratio is always bounded by Vk, with no 
assumptions on the structure of the functions Fi:t- In the sequel, we show that this a/X 
factor can be improved to a polylogarithmic term in K (albeit using a different algorithm) 
when Fi-t are univariate convex functions. 


3 Leveraging Convexity: The Local-to-Global Lemma 

To obtain the desired regret bound, our analysis must somehow take advantage of some 
special property of convex functions. In this section, we specify which property of convex 
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functions is leveraged in our proof. 

To gain some intuition, consider the following prior distribution, which is not restricted 
to convex functions: draw a point X* uniformly in [0,1] and sets all of the loss functions to 
be the same function, Ft{x) = (the indicator of a; 7 ^ X*). Regardless of the player’s 

policy, she will almost surely miss the point X*, observe the loss sequence 1,..., 1, and 
incur a regret of T. The reason for this high regret is that the prior was able to hide the 
good point X* in each of the loss functions without modifying them globally. However, if 
the loss functions are required to be convex, it is impossible to design a similar example. 
Specihcally, any local modihcation to a convex function necessarily changes the function 
globally (namely, at many different points). This intuitive argument is formalized in the 
following lemma; here we denote by = / g^dv the L 2 -norm of a function : [ 0 , 1 ] h-)■ M 
with respect to a probability measure z/. 

Lemma 6 (Local-to-global lemma). Let f,g : [0,1] 1 —)■ M 6 e convex functions. Denote 
X* = argmin^g[Q^^] f{x) and f* = f{x*), and let x G [0,1] such that g{x) < /* < f{x). Then 
for any probability measure v supported on [x*,x], we have 

11/-^11' > 11/-/II' 

{f{x)-g{x)r - ^ ’ (/(x)-/*)2 • 

To understand the statement of the lemma, it is convenient to think of / as a reference 
convex function, to which we compare another convex function g] see Fig. 1. If substantially 
differs from / at one point x (in the sense that g{x) < /*), then the lemma asserts that g 
must also differ from / globally (in the sense that ||/ — is large). 

Proof. Let X be a random variable distributed according to v. To prove the lemma, we 
must show that 

E(/(X)-^(X))^ = ^(/w - /*)' 

{f{x)-g{x)y - ^ ’ {f{x)-f*r • 

Without loss of generality we can assume that x > x*. Let xq be the unique point such that 
/(xo) = g{xo), and if such a point does not exist let Xq = x*. Note that Xq < x, and observe 
that g is below (resp. above) / on [xo,x] (resp. [x*,Xo]). 

Step 1 : We hrst prove that, without loss of generality, one can assume that g is linear. 
Indeed consider g to be the linear extension of the chord of g between x and Xq. Then we 
claim that: 

E(/(X)-^(X))^ ^ E(/(X)-^(X))^ 

{f{x)-g{x)y - (/(x)-^(x ))2 

Indeed the denominator is the same on both side of the inequality, and clearly by convexity 
g is always closer to / than g. Thus in the following we assume that g is linear. 








Step 2 : We show now that one can assume g{x) = /*. Let g be the linear function such 
that g{x) = f* and g^xo) = f{xo). Similarly to the previous step, we have to show that 
Eq. (4) holds true. We will show that h{y) = {f{y) — g{y)) /(/(?/) —g{y)) is non-increasing on 
[x*,x], which clearly implies Eq. (4). A simple approximation argument shows that without 
of generality one can assume that / is differentiable, in which case h is also differentiable. 
Observe that h\y) has the same sign as u{y) = f{y)igiy) - g{y)) - g'iy){f{y) - Kv)) + 
9\y)ifiy) - yiy)) ■ Moreover, u'{y) = f'\y){g{y) - g{y)) since g” = g” = 0, and thus u is 
decreasing on [xo,a;] and increasing on [a:*,a:o] (recall that by convexity fiy) > 0). Since 
w(xo) < 0 (in fact u(xo) = 0 in the case xq 7 ^ x*), this implies that u is nonpositive, and 
thus h is non-increasing, which concludes this step. 

Step 3 : It remains to show that when g is linear with g(x) = /*, then 

E(f(X) - g{X)f > P(W = a:*) ■ E(/(X) - tf . (5) 

For notational convenience we assume /* = 0. By monotonicity of / and g on [x*,x], one 
has 'iy G [x*,x], |/(|/) — g{y)\ > \f{y) ~ /(^o)|- Therefore, it holds that 

E(/(X) - g{X)f > E(/(X) - f{x,)f > Var(/(X)) = Ef{X) - {Ef{X)f. ( 6 ) 

Now using Cauchy-Schwarz one has E/(X) = E/(X)1{X 7 ^ x*} < ■\/P(X 7 ^ x*) ■ E/2(X), 
which together with Eq. ( 6 ) yields Eq. (5). □ 


4 Algorithm for Bayesian Convex Bandits 

In this section we present and analyze our algorithm for one-dimensional bandit convex 
optimization in the Bayesian setting, over /C = [0,1]. Recall that in Bayesian setting, there 
is a prior distribution X over a sequence of loss functions over /C, such that each function 
Ft is convex (but not necessarily Lipschitz) and take values in [0,1] with probability one. 

Before presenting the algorithm, we make the following simplihcation: given e > 0, we 
discretize the interval [0,1] to a grid = {xi,... , xk } of X = 1/e^ equally-spaced points 
and treat X^ as the de facto decision set, restricting all computations as well as the player’s 
decisions to this hnite set. We may do so without loss of generality: it can be shown (see 
Appendix C) that for any sequence of convex loss functions Fi,..., Ft : K, ^ [0,1]) fhe T- 
round regret (of any algorithm) with respect to X^ is at most 2eT larger than its regret with 
respect to /C, and we will choose e to be small enough so that this difference is negligible. 

After hxing a grid X^, we introduce the following dehnitions. We dehne the random 
variable X* = argmin 3 ,g;(.^ Pt{x), and for all t and i,j E [K] let 

ai^t = Pt(X* = Xi) , 

ftixi) = Et[Ft{xi)] , ( 7 ) 

/i,t(xj) = Et[Ft{xi) I X* = Xj] . 
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Inputs: prior distribution T ^ tolerance parameter e > 0 

Let K = 1/e^ and = {xi,... ,xk} with Xi = i/K for all i G [K] ; 

For round t = 1 to T: 

For all i G [K], compute ft{xi) and fi,t{xi) defined in Eq. (7) ; 

Find i* = argmin^ ft{xi) and let x* = Xi* ; 

Define the set 

St = {i G [K] : fi^t{xi) < ft{x*t) and at^t > ; (8) 

Sample Xt from the distribution vr^ = (vri^t, ..., vrx,i) over X^, given by 

V i G [iF] , TTi^t = ^ai^t ■'^{i ^ St} + (I - lat{St)) ■ t{i = i*} , (9) 

where we denote at{S) = ’ 

Play Xt and observe feedback Ft{Xt) ; 


Figure 2: A modihed Thompson Sampling strategy that guarantees 0(a/T) expected 
Bayesian regret for any prior distribution X over convex functions Fi,..., FV : [0,1] i—)■ [0,1]. 

In words, X* is the optimal action in hindsight, and at = (ai,o • • •, ax,i) is the posterior 
distribution of X* on round t. The function /^ : i—)■ [0,1] is the expected loss function on 

round t given the feedbacks observed in previous rounds, and for each j G [K], the function 
fj^t ■ X^ I—)■ [0,1] is the expected loss function on round t conditioned on X* = xj and on the 
history. 

Using the above dehnitions, we can present our algorithm, shown in Fig. 2. On each round 
t the algorithm computes, using the knowledge of the prior X and the feedback observed in 
previous rounds, the posterior at and the values ft{xi) and fi,t{,Xi) for all i G [K]. Also, it 
computes the minimizer x* of the expected loss ft over the set X^, which is the point that has 
the smallest expected loss on the current round. Instead of directly sampling the decision 
from the posterior at (as Thompson Sampling would do), we make the following two simple 
modihcations. First, we add a forced exploitation on the optimizer x}. of the expected loss to 
ensure that the player chooses this point with probability at least Second, we transfer the 
probability mass assigned by the posterior to points not represented in the set St, towards 
x}. The idea is that playing a point Xi with i ^ St is useless for the player, either because it 
has a very low probability mass, or because playing xt would not be (much) more prohtable 
to the player than simply playing x} on round t, even if she is told that xt is the optimal 
point at the end of the game. 

The main result of this section is the following regret bound attained by our algorithm. 

Theorem 7. Let Fi,... ,Ft : [0,1] i—)■ [0,1] be a sequence of convex loss functions drawn 
from an arbitrary prior distribution X. For any e > 0, the Bayesian regret of the algorithm 
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described in Fig. 2 over is upper-bounded by 


loVriog — + lOeT 
e 

In particular, for e = 1/a/T we obtain an upper bound of 0{-\/T\ogT) over the regret. 

Proof. We bound the Bayesian regret of the algorithm (with respect to d4) on a per-round 
basis, via the technique described in Section 2.2. Namely, we £x a round t and bound 
Kt[rt{Xt)] in terms of Kt[vt{Xt)] (see Eq. (3)). Since the round is fixed throughout, we omit 
the round subscripts from our notation, and it is understood that all variables are fixed to 
their state on round t. 

First, we bound the expected regret incurred by the algorithm on round t in terms of the 
posterior a and the expected loss functions f,fi,...,fx- 

Lemma 8. With probability one, it holds that 

Et[rt{Xt)] < '^ai{f{xi)-fi{xi)) + e. ( 10 ) 

ieS 

The proofs of all of our intermediate lemmas are deferred to the end of the section. Next, 
we turn to lower bound the information gain of the algorithm (as dehned in Eq. (3)). Recall 
our notation || 5 f||^ that stands for the L 2 -iiorm of a function g : K, ^ M. with respect to a 
probability measure u over /C; specifically, for a measure u supported on the finite set we 
have \\g\\l = 

Lemma 9. With probability one, we have 

E.WV,)] > 5^aill/-/.liy (11) 

ieS 



We now set to relate between the right-hand sides of Eqs. (10) and (11), in a way that 
would allow us to use Lemma 5 to bound the expected cumulative regret of the algorithm. In 
order to accomplish that, we hrst relate each regret term f{xi) — fi{xi) to the corresponding 
information term ||/ — fiWf- Since / and the ffs are all convex functions, this is given by 
the local-to-global lemma (Lemma 6) which lower-bounds the global quantity ||/ — /j||^ in 
terms of the local quantity f{xi) — fi{xi). 

To apply the lemma, we establish some necessary definitions. For all i G S', define e* = 
e \xi — x*\, and let S'j = S' n [xj, x*] be the neighborhood of Xj that consists of all points in S' 
lying between (and including) x, and the optimizer x* of /. Now, define weights Wi for all 
i G S' as follows: 


yp ^ieS , 


Wi 


f ~ /(^*) + 

\f{x,)-f{x*) + ej 


and Wi* = TTj* . (12) 


With these dehnitions. Lemma 6 can be used to prove the following. 
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Lemma 10. For all i & S it holds that \\f — fi\\l > — e^. 

Now, averaging the inequality of the lemma with respect to a over all z G S' and using 
the fact that y/a + b < ^/a + \/b for any a, 6 > 0, we obtain 



On the other hand, the Cauchy-Schwarz inequality gives 


^ . l'^aiWi{f{xi) - fi{xi))‘^ . 
i&s V *eS V *e5 


Combining the two inequalities and recalling Lemmas 8 and 9, we get 


Mn{Xt)] 


< 



VEant(Xi)]+6) + 


e . 


(13) 


It remains to upper bound the sum This is accomplished in the following lemma. 

Lemma 11. We have 


< 201 og?^. 

ieS * 

Finally, plugging the bound of the lemma into Eq. (13) and using Lemma 5, yields the 
stated regret bound. □ 


4.1 Remaining Proofs 

We first give the proof of Lemma 8. Recall that for readability, we omit the subscripts 
specifying the round number t from our notation. 

Proof. The expected instantaneous regret can be written in terms of the distributions vr and 
a, and the functions fi,..., fx and / as follows: 

K 

Et[rt{Xt)] = '^TiiVtixi) 
i=l 

K K 

= '^7TiEt[Ft{xi)] -'^aiEtlFtixi) \ X* = Xi] 

i=l i=l 

K K 

= -'^aifi{xi) . 

i=l i=l 

Next, we consider the first sum in the right-hand size of the above, that corresponds to 
the expected loss incurred by the algorithm. Since tt is obtained from a by transferring 
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probability mass towards x* (whose loss is the smallest), the expected loss of the algorithm 
has 


K 

= l'^aif{xi) + {1 - lq{S))f{x*) 

i=i ieS 

< + (1 - (liS))f{x*) 

ieS 

= ^ aif{xi) + ^ aif{x*) . 
ies i^S 

Also, since for each i ^ S we either have or f{x*) — fi{xi) < 0 (while both quantities 

are trivially bounded by 1), 


- fi{xi)) < e. 

i^S 

Hence, for the regret we have 

K K 

E.t[rt{Xt)] = '^Tiifixi) -'^aifiixi) 
i=l i=\ 

< X] - fi{xi)) + ^ ai(/(x*) - fi{xi)) 

ies i^s 

< -/^(a^i)) + e • □ 

i£S 


Next, we prove Lemma 9. 

Proof. The expected instantaneous information gain can be written as 

K 

Eaut(Xi)] = J27i,V8.Tt{Et[Ft{x,)\X^]) 

i=j 

K K 

= J2Y1 i^t[Ft{x,) I X* = X,] - Et[Ft{xj)]Y 

i=l j=l 
K K 

i=l j=l 

The lemma then follows from 

K K K 

'^'^ai7ij{f{xj) - fi{xj)Y > '^ai'^nj{f{xj) - fi{xj)f = '^aiWf - fi\\l . □ 

i=l j = l i£S j = l i£S 


13 


We now turn to prove Lemma 10. The proof uses the local-to-global lemma (Lemma 6) 
discussed earlier in Section 3. 

Proof. The lemma holds trivially for i = i*, as we defined Wt* = vTj*, whence 

\\f-u\\l > > lwAf{x*)-fAx*)Y-e\ 

Therefore, in what follows we assume that i G S' and i ^ i*. 

Consider a regularized version of the function /, given by /e(x) = f{x) + e\x — x*\. 
Notice that is convex, and has a unique minimum at x* with fe{x*) = f{x*). Since 
TTj* > I by construction (the algorithm exploits with probability |), and for all z G S' we 
have fi{xi) < /e(x*) < fe{xi), we can apply Lemma 6 to the functions fe and fi and obtain 

Ejgg, - fijxj))^ ^ 1 ^ Ejes. 

{fe{Xi) - /i(Xi))2 - 2 {fe{Xi) - /e(x*))2 

Now, notice that f^{xj) — f^{x*) = f{xj) — f{x*) + e \xj — x*\ for all j; hence, recalling Eq. (12), 
the right-hand side above equals Rearranging and using ||/g — /j||^ > J2jeSi ~ 

fiixj)f gives 


||/6-/i||^ > IwiiMxi) - fi{xi)Y . (14) 

To obtain the lemma from Eq. (14), observe that by the triangle inequality, 

Wf-fih > ll/.-/*IU-||/.-/IU > Wfe-fih-e, 

so using {a+by < 2{a^ + b‘^) we can upper bound the left-hand side of Eq. (14) as ||/g —/j||^ < 
2||/ — /i||^ + 2e^. On the right hand-side of the same inequality, we can use the lower bound 
ifeixi) - fi{xi)f > {f{xi) - fi{xi)y that follows from f,{xi) - fi{xi) > f{xi) - fi{xi) > 0. 
Combining these observations with Eq. (14), we have ||/ — /i||^ > \wi{f{xi) — /j(xj))^ — e^, 
which concludes the proof. □ 

Finally, we prove Lemma 11. 

Proof. Since < 27rj for all z G S', it is enough to bound the sum J2ies decompose 

this sum into three disjoint parts: the term corresponding to z = z* (in case z* G S') that 
equals 1 as Wi* = tt** by definition, a sum over the indices z G S' such that Xt < x*, and a sum 
over those such that Xj > x*. The proof is identical for both the latter sums, thus we focus on 
the set S' of indices such that x* > x*. Up to reindexing, we can assume that S" = {1,..., iL'} 
for some K' < K, and the corresponding points are such that x* < xi < ... < xk'- By our 
dehnition of Wi (see Eq. (12)), we have 

y ^ ^ ^ ^ T^ijfjxi) - fix*) + ej)^ 

Ej=i - f(^*) + 
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Observe that for all i E S' it holds that e* = e |xj — x*| > as the points Xi,... ,xk he 
on an equally-spaced grid of the interval (and x, 7 ^ x* since i* ^ S'). Recall also that by 
construction 7ii > ^ for all i E S. Hence, we have 

Vies", Kj^)^ < TTiifixi) - /(x*) + < dvTi . 

Now, denote /?* = “ /(^*) + which < A < ... < < 4. 

Thus, we have 


K' 


E 


Wi 


K' 


i + E 

i=2 


A ~ A-1 



< 


K' 

1 + 5^ log 

i=2 



1 log 


/^iC' 




where the inequality follows from the fact that log 2 ; < 2 ; — lfor0 <2;<l. Since (3k' / Pi < 
(—)3, vve can bound the right-hand side by 1 -|- Slog—. The lemma now follows from 
applying the same bound to the other part of the total sum (over the indices i such that 
Xi < X*) and recalling the possible term corresponding to i = i*. □ 


5 Discussion and Open Problems 

We proved that the minimax regret of adversarial one-dimensional bandit convex optimiza¬ 
tion is 0{\/T) by designing an algorithm for the analogous Bayesian setting and then using 
minimax duality to upper-bound the regret in the adversarial setting. Our work raises in¬ 
teresting open problems. The main open problem is whether one can generalize our analysis 
from the one-dimensional case to higher dimensions (say, even n = 2). While much of 
our analysis generalizes to higher dimensions, the key ingredient of our proof, namely the 
local-to-global lemma (Lemma 6) is inherently one-dimensional. We hope that the compo¬ 
nents of our analysis, and especially the local-to-global lemma, will inspire the design of 
efficient algorithms for adversarial bandit convex optimization, even though our end result 
is a non-constructive bound. 

The Bayesian algorithm used in our analysis is a modihed version of the classic Thompson 
Sampling strategy. A second open question is whether or not the same regret guarantee can 
be obtained by vanilla Thompson Sampling, without any modihcation. However, if it turns 
out that unmodihed Thompson Sampling is sufficient, the proof is likely to be more complex: 
our analysis is greatly simplihed by the observation that the instantaneous regret of our 
algorithm is controlled by its instantaneous information gain on each and every round—a 
claim that does not hold for Thompson Sampling. 

Finally, we note that our reasoning together with Proposition 5 of Russo and van Roy 
(2014) allows to recover effortlessly Theorem 4 of Bubeck et al. (2012), which gives the 
worst-case minimax regret for online linear optimization with bandit feedback on a discrete 
set in M”. It would be interesting to see if this proof strategy also allows to exploit geo¬ 
metric structure of the point set. For instance, could the techniques described here give an 
alternative proof of Theorem 6 of Bubeck et al. (2012)? 
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A Proof of Theorem 2 

The proof relies on Sion’s generalization of von Neumann’s minimax theorem, which we state 
here for completeness (see Corollary 3.3 of Sion, 1958, or Komiya, 1988). 

Theorem 12. Let X and Y be convex sets in two linear topological spaces, and suppose that 
X is compact. Let f be a real-valued function on X x Y such that 
(i) f{x, •) is upper semicontinuous and concave on Y for each x G X; 

(a) f{-,y) is lower semicontinuous and convex on X for each y eY. 

Then, 

min sup/(x, y) = sup min/(x, ?/) . 

Proof of Theorem 2. For a metric space A we denote by A (A) the set of Borel probability 
measures on A. Let C be the space of convex functions from the compact /C to [0,1]. A 
deterministic player’s strategy is specihed by a sequence of operators oi,..., a^, where in 
the full information case Og : —)■ JC, and in the bandit case Og : [0,1]^“^ —)■ /C. We denote 

by A the set of such sequences of operators, which is compact in the product topology. The 
minimax regret can be written as: 

min sup E[i?r] , (15) 

«eA(.4) 
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where Rt denotes the induced T-round regret, and the expectation is with respect to the 
random draw of a player’s strategy from u. Using Sion’s minimax theorem, we deduce 
Eq. (15) is equal to 

sup min , (16) 

^eA(cr) «6A(^) 

where the expectation is with respect to both the random draw of a player’s strategy from 
M, and the random draw of the sequence of losses from R. Finally, to convert the statement 
above to the statement of Theorem 2, we invoke Kuhn’s theorem on the payoff equivalence 
of behavioral strategies to general randomized strategies. More precisely, we apply the 
continuum version of this theorem, established by Aumann (1964). □ 


B Information Theoretic Analysis of Bayesian Algo¬ 
rithms 


In this section we prove Lemma 5, restated here. 


Lemma 5 (Russo and van Roy, 2014). For any player strategy and any prior distribution R, 
it holds that 


E 





t=l 


The proof follows the analysis of Russo and van Roy (2014). For the proof, we require 
the following dehnition. Let 


WxeJC, It{x) = \t{Ft{x);X*) 


be the mutual information between X* and the player’s loss on round t upon choosing the 
action x E 1C, conditioned on the history Rt-i (thus, It{x) is a random variable, measurable 
with respect to Rt-i)- Intuitively, It{x) is the expected amount of information on X* revealed 
by playing x on round t of the game and observing the feedback Ft{x). 

Before proving Lemma 5, we hrst show an analogous claim for the information terms 


Lemma 13. We have 


V T 


E 


.i=l 


< ^/T\ogK . 


Proof. Let us examine how the entropy of the random variable X* evolves during the game 
as the player gathers the observations Fi(Xi),... ,Ft{Xt). Denoting by H 4 (-) the entropy 
conditional on Rt-i, we have by standard information theoretic relations. 


hix) = \tiR{xy,X*) = EaHi(X*)-Hi+i(X*) |X, = x] 
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for all points x G /C. Thus, 


EtihiX,)] = Ei[Hi(X*) - Hi+i(X*)] . 


Summing over t and taking expectations, we obtain 


E 




4=1 


5^E[Hi(X*)-Hi+i(X*)] < E[Hi(X*)] = H(X* 


t=i 


Using Cauchy-Schwarz and the concavity of the square root yields 


■ T 



■ T 


yE,|/,(x,)i 

.t=i 

< Vt-. 

E 

.i=l 

< VH(X*)T 


Recalling that the entropy of any random variable supported on K atoms is upper bounded 
by log K, the lemma follows. □ 


We can now prove Lemma 5. 

Proof of Lemma 5. Let a* G A(/C) be the posterior distribution of X* given Pit-i, with 
C(t,x = Et(X* = x) for all x G /C. By the dehnition of mutual information, for all x G /C, 


R(x) = \t{Ft{x);X*) = DKL(Qx,Qrr| 2 y) > 

y&K 

where Q^, is the distribution of Ft{x) conditioned on Pit-i, and Qa;|j^ is the distribution of 
Ft{x) conditioned on PLt-i and the event X* = y. Applying Pinsker’s inequality on each 
term on the right-hand side of the above, we obtain 

^ cii, y (Eq^i jFi(x)]-EQjFt(x)])^ 
y&JC 

= J2at,y{Et[Ft{x) \X* = y]- Et[Ft{x)]Y 

y&K 

= Vart(Et[Ft(x) | X*]) 

= vt{x) . 

Hence, Vt{x) < for all x G X, which implies that Et[vt{Xt)] < ^EjA(Xi)] with 

probability one. Combining this with Lemma 13, the result follows. □ 


C Effective Lipschitz Property of Convex Functions 

In this section we show that any convex function is essentially Lipschitz, and justify our 
simplifying discretization made in Section 4. The required property is summarized in the 
following lemma. 
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Lemma 14. Let K, C M"- he a convex set that contains a ball of radius r, and let f be a 
convex function over K. that takes values in [0,1]. Then for a 5-net X of K, with 5 < \re^, 
it holds that Ymn.x^x fix) < fix) + e. 

In particular, for the unit interval [0,1] it is enough to take a grid with K = ^ equally- 
spaced points, to have an e-approximation to the optimum of any convex function over [ 0 , 1 ] 
taking values in [0,1]. In fact, to obtain the same e-approximation property it is enough 
to use a more compact grid of size O(^log^), whose points are not equally spaced; see 
Appendix D below for more details. 

Lemma 14 is a consequence of the following simple property of convex functions, observed 
by Flaxman et ah (2005). 

Lemma 15. Let K C ML be a convex set that contains a ball of radius r centered at the 
origin, and denote /C^ = (1 — e)/C. Let f be a convex function over 1C such that 0 < fix) < C 
for all X E 1C. Then 

(i) for any x E fCe and y E K, it holds that \fix) — fiy)\ < ^\\x — y\\; 

(ii) mina,6K:, fix) < f ix) + Ce. 

Proof of Lemma 14- Via a simple shift of the space, we can assume without loss of generality 
that /C contains a ball of radius r centered at the origin. Let 2 ; = argmin^g^^/(x) and 
y = argmin 2 ,gy^/, where /C' = (1 — |)/C. By the definition of the h-net X, there exists a 
point X E X ioT which ||a: — |/|| < 5. Since y E 1C' and x E 1C, part (i) of Lemma 15 shows 
that fix) — fiy) < ^5 < |. On the other hand, part (ii) of the same lemma says that 
fiv) ~ fi^) ^ f- Combining the inequalities we now get fix) < fiz) e = miuj-gx: fix) + e, 
which gives the lemma. □ 

D Constructive Upper Bound in One Dimension 

Here we describe an explicit and efficient one-dimensional algorithm for bandit convex opti¬ 
mization with general (possibly non-Lipschitz) convex loss functions over 1C = [0,1], whose 
regret performance is better than the general (9(T®/®) bound of Flaxman et ah (2005) that 
applies in an arbitrary dimension. The algorithm is based on the Exp 3 strategy for online 
learning with bandit feedback over a finite set of K points (i.e., arms), whose expected regret 
is bounded by OiVTK); see Auer et al. ( 2002 ) for further details on the algorithm and its 
analysis. 

In order to use Exp 3 in our continuous setting, where the decision set is [0,1], we form an 
appropriate discretization of the interval. It turns out that using a uniform, equally^spaced 
grid is suboptimal and can only give an algorithm whose expected regret is of order 0 (T^/^). 
Nevertheless, by using a specially-tailored grid of the interval we can obtain an improved 
( 9 ( 2 " 2 / 3 ) Pound; this customized grid is specihed in the following lemma. 

Lemma 16. For 0 < e < 1, define Xk = e(l + e)^ for all k > 0. Then the set X^ = 
{xk, 1 — Xk}^=Q n [0,1] satisfies: 

(%) IV.I < f log^; 
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(ii) for any convex function f : [ 0 , 1 ] h-)■ [ 0 , 1 ], we have f{x) < mirLi-gjo^] f{x) + 2 e. 

Proof. To see the first claim, note that for fc > ^ log ^ we have Xk = e(l+e)^ > e exp(|/ce) > 1, 
where we have used the fact that < 1 + 2 x for 0 < a: < 1 . 

Next, we prove that for any y E [e, 1 — e], there exists x E such that \f{y) — f{x)\ < e; 
this would imply our second claim, as by Lemma 15 the minimizer of / over [e, 1 — e] can only 
be e larger than its minimizer over the entire [0,1] interval. We focus on the case y E (0, ^]; 
the case y G [|, 1) is treated similarly. Then, we have y E [y,l — y] so Lemma 15 shows that 
for any x E [0,1] we have 


\f{x)-f{y)\ < l\x-y\ = (17) 

Now, let k be the unique natural number such that Xk <y < Xk+i- Notice that 1 < Xk+i/y < 

I + e, since Xk+i/^k = 1 + e. Hence, setting x = Xk+i in Eq- (17) yields \f{x) — f{y)\ < 

II — 11 < e, as required. □ 

In view of the lemma, the algorithm we propose is straightforward: given a parameter 
e > 0 , form a grid of the interval [0,1] as described in the lemma, and execute the Exp3 
algorithm over the hnite set X^. 

Theorem 17. The algorithm described above with e = guarantees regret 

against any seguence fi^x of convex (not necessarily Lipschitz) functions over /C = [0,1] 
taking values in [0,1]. 

Proof. Let Xi-^t be the sequence of points from X^ chosen by Exp3. By the regret guarantee 
of Exp 3, we have 


E 


T 




T 


min 


t=l 


o{Vtk) = , 


where K = \Xf\, that according to Lemma 16 has K < * log 1. On the other hand. Lemma 16 
also ensures that 


1 T ^ T 


t=i t=i 

Combining the inequalities we get the following regret bound with respect to the entire /C: 

T T 

min 

x&K. 


E 


'^ftixt) - ft{x) 


t=i 


t=i 


= 0( J^T + eT] . 


Finally, choosing e = T gives the theorem. 


□ 
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